Note: This exercise is adapted from the original here. As of September 2020 if you install pandas_profiling on conda you might get an old version (1.41) as it seems for this package some channels on conda are a bit older then the latest version on pypi (2.9.0 as of September 2020). To be super clear you can see the exact enviornment and library versions used to run this exercise in the Pipefile (see pipenv for more details) of this example here.
Source of data: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh
The autoreload instruction reloads modules automatically before code execution, which is helpful for the update below.
%load_ext autoreload
%autoreload 2
Make sure that we have the latest version of pandas-profiling.
# # uncomment and run below if you need to pip install the pandas-profiling library
# import sys
# !{sys.executable} -m pip install -U pandas-profiling==2.9.0
# !jupyter nbextension enable --py widgetsnbextension
You might want to restart the kernel now.
conda install -c anaconda pandas-profiling
from pathlib import Path
import requests
import numpy as np
import pandas as pd
import pandas_profiling
from pandas_profiling.utils.cache import cache_file
We add some fake variables for illustrating pandas-profiling capabilities
file_name = cache_file(
"meteorites.csv",
"https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD",
# 'https://data.nasa.gov/resource/gh4g-9sfh.csv',
)
print(file_name)
df = pd.read_csv(file_name)
# Note: Pandas does not support dates before 1880, so we ignore these for this analysis
df['year'] = pd.to_datetime(df['year'], errors='coerce')
# Example: Constant variable
df['source'] = "NASA"
# Example: Boolean variable
df['boolean'] = np.random.choice([True, False], df.shape[0])
# Example: Mixed with base types
df['mixed'] = np.random.choice([1, "A"], df.shape[0])
# Example: Highly correlated variables
df['reclat_city'] = df['reclat'] + np.random.normal(scale=5,size=(len(df)))
# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add[u'name'] = duplicates_to_add[u'name'] + " copy"
df = df.append(duplicates_to_add, ignore_index=True)
df
report = df.profile_report(sort='None', html={'style':{'full_width': True}}, progress_bar=False)
report
profile_report = df.profile_report(html={'style': {'full_width': True}})
profile_report.to_file("tmp/example.html")
profile_report = df.profile_report(explorative=True, html={'style': {'full_width': True}})
profile_report
profile_report.to_widgets()